Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of duplicate keys in objects #85

Merged
merged 4 commits into from
Dec 21, 2023
Merged

Conversation

kutsurak
Copy link
Contributor

Here is the promised pull request for the handling of duplicate keys. This is just a linear pass over the parsed object keeping the last value if a key has been seen before. As I mentioned in #84 this will probably have a performance impact if the objects are large, but I have not tested. I also added a few tests that make sure that this change behaves reasonably.

As a general conclusion I think that this change does not play very well with the ordered concept in the library (see the expected values of the two tests I added), but I do believe this is the best we can do with respect to correctness.

For performance we might be able to use some more efficient data structure (a hash table comes to mind) but this would probably require more extensive changes in the parser.

@aconchillo
Copy link
Owner

Hi @kutsurak. I actually implemented this using a hash table which should fix the performance issues. I took the liberty to get a couple of lines of your unit tests, the other line was not necessary because since we are now using a hash table we won't be able to guarantee the order. Let me know what you think. Thank you!!!! 🙏

@aconchillo
Copy link
Owner

Oooops, forgot to mention #86.

@aconchillo
Copy link
Owner

I tested this PR with a large file and got these results:

scheme@(guile-user)> ,t (->bool (call-with-input-file "large-file.json" (lambda (port) (json->scm port))))
$8 = #t
;; 1.261434s real time, 1.657068s run time.  0.465849s spent in GC.

So it doesn't have to have a big impact, which is nice. Actually, it seems to behave better than the hash table approach (see #86 (comment)). I need to test bigger files since the one I tested was an array with objects in it and the objects were not too big. So, the hash table approach would probably be better for really big objects, but I don't think that's usually the case.

@kutsurak
Copy link
Contributor Author

Hello @aconchillo

Yes I agree that for large objects a hash map will have better performance, but for small objects the setup overhead becomes expensive. I am fine with either solution, or even with a combination of the two where there is a keyword argument to allow users to switch between the recursive and the hash table solutions depending on their data.

Let me know if you want me to do some more work on this.

json/parser.scm Outdated
(cond ((null? pairs) res)
((assoc (caar pairs) res)
(uniquify-keys (cdr pairs) res))
(#t (uniquify-keys (cdr pairs) (cons (car pairs) res)))))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use else here instead of #t, i think it's a bit more idiomatic.

@aconchillo
Copy link
Owner

Hello @aconchillo

Yes I agree that for large objects a hash map will have better performance, but for small objects the setup overhead becomes expensive. I am fine with either solution, or even with a combination of the two where there is a keyword argument to allow users to switch between the recursive and the hash table solutions depending on their data.

Let me know if you want me to do some more work on this.

Just added one tiny comment. After that I will merge it. Thank you!

@aconchillo aconchillo merged commit 05ce74c into aconchillo:master Dec 21, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants